This vignette summarises the findings from the 100 days and 100 lines of code workshop, hosted in December 2022 by Epiverse-TRACE.
To answer this question, we invited 40 experts, including academics, field epidemiologists, and software engineers, to take part in a 3-day workshop, where they discussed the current challenges, and potential solutions, in data analytic pipelines used to analyse epidemic data.
To investigate this in a similar setting to what an outbreak response team would experience, workshop participants were divided into groups, and asked to develop a plausible epidemic scenario, that included:
A situation report, describing the characteristics of the epidemic
A linelist of cases and contact tracing data, by modifying provided datasets containing simulated data
A set of questions to address during the analytic process
Groups then exchanged epidemic scenarios and analysed the provided data to answer the questions indicated the previous group, as if they were a response team working to solve an outbreak. Details about each of these outbreak scenarios and the analytic pipelines developed by the groups are summarised in this vignette
Before the workshop, a fictitious dataset was created, which consisted of a linelist and contact tracing information.
To generate linelist data, the package bpmodels
was used to generate a branching process network. Cases were then
transformed from the model output to a linelist format. To add plausible
hospitalisations and deaths, delay distributions for SARS-CoV were
extracted from epiparameter.
To create the contact tracing database, a random number of contacts was generated for each of the cases included in the linelist. These contacts were then assigned a category of became case, under follow up or lost to follow up, at random.
Data cleaning
Delay distributions
fitdisrplus
to fit parameteric distributions to scenario dataepiparameter
to extract delay distributions from respiratory pathogensEpiNow2
to fit reporting delaysEpiEstim
/ coarsedatatools
to estimate incubation period of diseaseepicontactsmixdiffPopulation demographics
ColOpenDataRisk factors of infection
Severity of disease
datadelay
for CFR calculationContact matching
Epi curve and maps
incidence
and incidence2
for incidence calculation and visualisationrasterR
to extract spatial information from library of shapefilesReproduction number
Superspreading, by using this resource
Epidemic projections
incidence
R estimation using a loglinear modelprojections
using Rt estimates, SI distributions and overdispersion estimatesTransmission chains and strain characterisation
| Data analysis step | Challenges |
|---|---|
| Data cleaning | Not knowing what packages are available for this purpose |
| Delay distributions | Dealing with right censoring Accounting for multiple infectors |
| Population demographics | Lacking tools that provide information about population by age, gender, etc. |
| Risk factors of infection | Distinguishing between risk factors vs detecting differences in reporting frequencies among groups |
| Severity of disease | Knowing the prevalence of disease (denominator) Right censoring Varying severity of diffeent strains |
| Contact matching | Missing data Misspellings |
| Epicurve and maps | NA dates entries not included Reporting levels varying over time |
| Offspring distribution | Right censoring Time varying reporting efforts Assumption of a single homogeneous epidemic Importation of cases |
| Forecasting | Underlying assumption of a given R distribution, e.g., single trend, homogeneous mixing, no saturation |
fastlink
for probabilistic matching between cases ↔︎ contacts, based on names,
dates, and agesapyramid
to stratify data by age, gender, and health statusdplyr and data.tableEpiEstim
or EpiNow2EpiNow2
to calculate average hospitalisation duration and forecasting| Data analysis step | Challenges |
|---|---|
| Data anonymisation | Dealing with typos and missing data when generating random unique identifiers |
| Reproduction number | Right censoring Underestimation of cases due to reporting delays |
| Projection of hospital bed requirements | Incomplete data (missing discharge date) Undocumented functionality in R packages used |
| Zoonotic transmission | Poor documentation Unavailability of packages in R Differentiation between zoonotic transmission and risk factors- need for population data |
| Attack rate | Not enough information provided |
rio,
readxl, readr,
or openxlsxjanitorpointblank,
assertr,
compareDF,
or skimrmatchmaker,
lubridate,
or parsedatehmatch,
assertr,
or queryRjanitor
and tidyversedplyr, or powerjoinmatchmakerfitdistrplus
to fit parameteric distributions to epidemic datasitrep
to generate reportsgadm
to get population dataepitabulate
to describe datasf
and ggplot2
to plot data| Data analysis step | Challenges |
|---|---|
| Detection of outliers | No known tools to use |
| Severity of disease | Censoring |
| Spillover events | Missing data |
epiR
to check for data censoring| Data analysis step | Challenges |
|---|---|
| Data cleaning | No available R packages specific for epidemic data |
| Reproduction number | Difficulty finding parameter estimations in the literature |
| Severity | Missing cases Need for an R package for systematic censoring analysis |
cookiecutter,
reportfactory,
and orderlyepitrixfitdistrplus
to fit parameteric distributions to scenario dataapyramid
to stratify data by age, gender, and health statusincidence
for static reports| Data analysis step | Challenges |
|---|---|
| Project structure | Working simultaneously on the same script and managing parallel tasks |
| Data cleaning | Losing too many data entries when removing NA rows Non standardised data |
| Delay distributions | Need to identify the best method to calculate, or compare functionality of tools |
| Severity of disease | Censoring and truncation Underestimation of mild cases |
| Zoonotic transmisison | Need for specific packages with clear documentation |